Objectives

  • Randomisation and basic statistics
  • Statistical hypothesis testing: t-test
  • Sample size calculation
  • Analysis for categorical data
  • Linear regression and correlation

Part 1: Basic statistics

Randomisation

Random selection of samples from a larger set

Let’s assume that we have the population with a total of 10 subjects. Suppose we label them from 1 to 10 and randomly would like to select 3 subjects we can do this using the sample function. When we run sample another time, different subjects will be selected. Try this a couple times.

sample(10, 3)
## [1] 8 1 9
sample(10, 3)
## [1]  3  9 10

Now suppose we would like to select the same randomly selected samples every time, then we can use a random seed number.

set.seed(3728)
sample(10, 3)
## [1] 5 8 7
set.seed(3728)
sample(10, 3)
## [1] 5 8 7

Let’s practice with fun example. Select two in our group member for coming early next Monday.

group.member <- c('Meena', 'Tsung-Heng', 'Ting', 'April', 'Dan', 'Cyril', 'Kylie', 'Sara')
sample(group.member, 2)
## [1] "Dan"   "Cyril"

Completely randomized order of MS runs

We can also create a random order using all elements of iPRG dataset. Again, we can achieve this using sample, asking for exactly the amount of samples in the subset. This time, each repetition gives us a different order of the complete set.

msrun <- unique(iprg$Run)
## Error in unique(iprg$Run): object 'iprg' not found
msrun
## Error in eval(expr, envir, enclos): object 'msrun' not found
## randomize order among all 12 MS runs
sample(msrun, length(msrun))
## Error in sample(msrun, length(msrun)): object 'msrun' not found
## different order will be shown.
sample(msrun, length(msrun))
## Error in sample(msrun, length(msrun)): object 'msrun' not found

Randomized block design

  • Allow to remove known sources of variability that you are not interested in.

  • Group conditions into blocks such that the conditions in a block are as similar as possible.

  • Randomly assign samples with a block.

This particular dataset contains a total of 12 MS runs across 4 conditions, 3 technical replicates per condition. Using the block.random function in the psych package, we can achieve randomized block designs! block.random function makes random assignment of n subjects with an equal number in all of N conditions.

library("psych") ## load the psych package

msrun <- unique(iprg[, c('Condition','Run')])
## Error in unique(iprg[, c("Condition", "Run")]): object 'iprg' not found
msrun
## Error in eval(expr, envir, enclos): object 'msrun' not found
## 4 Conditions of 12 MS runs randomly ordered
block.random(n = 12, c(Condition = 4))
##     blocks Condition
## S1       1         2
## S2       1         4
## S3       1         3
## S4       1         1
## S5       2         1
## S6       2         2
## S7       2         4
## S8       2         3
## S9       3         4
## S10      3         3
## S11      3         2
## S12      3         1
block.random(n = 12, c(Condition = 4, BioReplicate=3))
##     blocks Condition BioReplicate
## S1       1         1            3
## S2       1         1            2
## S3       1         4            2
## S4       1         2            2
## S5       1         3            2
## S6       1         2            3
## S7       1         1            1
## S8       1         4            1
## S9       1         3            3
## S10      1         3            1
## S11      1         2            1
## S12      1         4            3

Basic statistical summaries

Calculate simple statistics

Let’s start data with one protein as an example and calculate the mean, standard deviation, standard error of the mean across all replicates per condition. We then store all the computed statistics into a single summary data frame for easy access.

We can use the aggregate function to compute summary statistics. aggregate splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

# check what proteins are in dataset, show all protein names
head(unique(iprg$Protein))
## Error in unique(iprg$Protein): object 'iprg' not found
# Let's start with one protein, named "sp|P44015|VAC2_YEAST"
oneproteindata <- iprg[iprg$Protein == "sp|P44015|VAC2_YEAST", ]
## Error in eval(expr, envir, enclos): object 'iprg' not found
# there are 12 rows in oneproteindata
oneproteindata
## Error in eval(expr, envir, enclos): object 'oneproteindata' not found
# If you want to see more details,
?aggregate

Calculate mean per groups

## splits 'oneproteindata' into subsets by 'Condition',
## then, compute 'FUN=mean' of 'log2Int'
sub.mean <- aggregate(Log2Intensity ~ Condition,
                      data = oneproteindata,
                      FUN = mean)
## Error in eval(m$data, parent.frame()): object 'oneproteindata' not found
sub.mean
## Error in eval(expr, envir, enclos): object 'sub.mean' not found

Calculate SD (standard deviation) per groups

\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2} \]

Challenge

Using the aggregate function above, calculate the standard deviation, by applying the median function.

## The same as mean calculation above. 'FUN' is changed to 'sd'.
sub.median <- aggregate(Log2Intensity ~ Condition,
                    data = oneproteindata, FUN = median)
## Error in eval(m$data, parent.frame()): object 'oneproteindata' not found
sub.median
## Error in eval(expr, envir, enclos): object 'sub.median' not found

Using the aggregate function above, calculate the standard deviation, by applying the sd function.

## The same as mean calculation above. 'FUN' is changed to 'sd'.
sub.sd <- aggregate(Log2Intensity ~ Condition,
                    data = oneproteindata, FUN = sd)
## Error in eval(m$data, parent.frame()): object 'oneproteindata' not found
sub.sd
## Error in eval(expr, envir, enclos): object 'sub.sd' not found

Count the number of observation per groups

Challenge

Using the aggregate function above, count the number of observations per group with the length function.

## The same as mean calculation. 'FUN' is changed 'length'.
sub.len <- aggregate(Log2Intensity ~ Condition,
                     data = oneproteindata,
                     FUN = length)
## Error in eval(m$data, parent.frame()): object 'oneproteindata' not found
sub.len
## Error in eval(expr, envir, enclos): object 'sub.len' not found

Calculate SE (standard error of mean) per groups

\[ SE = \sqrt{\frac{s^2}{n}} \]

sub.se <- sqrt(sub.sd$Log2Intensity^2 / sub.len$Log2Intensity)
## Error in eval(expr, envir, enclos): object 'sub.sd' not found
sub.se
## Error in eval(expr, envir, enclos): object 'sub.se' not found

We can now make the summary table including the results above (mean, sd, se and length).

## paste0 : concatenate vectors after convering to character.
(grp <- paste0("Condition", 1:4))
## [1] "Condition1" "Condition2" "Condition3" "Condition4"
## It is equivalent to paste("Condition", 1:4, sep="")
summaryresult <- data.frame(Group = grp,
                            mean = sub.mean$Log2Intensity,
                            sd = sub.sd$Log2Intensity,
                            se = sub.se,
                            length = sub.len$Log2Intensity)
## Error in data.frame(Group = grp, mean = sub.mean$Log2Intensity, sd = sub.sd$Log2Intensity, : object 'sub.mean' not found
summaryresult
## Error in eval(expr, envir, enclos): object 'summaryresult' not found

Visualization with error bars for descriptive purpose

error bars can have a variety of meanings or conclusions if what they represent is not precisely specified. Below we provide some examples of which types of error bars are common. We’re using the summary of protein sp|P44015|VAC2_YEAST from the previous section and the ggplot2 package as it provides a convenient way to make easily adaptable plots.

# means without any errorbar
p <- ggplot(aes(x = Group, y = mean, colour = Group),
            data = summaryresult)+
    geom_point(size = 3)
## Error in ggplot(aes(x = Group, y = mean, colour = Group), data = summaryresult): could not find function "ggplot"
p
## Error in eval(expr, envir, enclos): object 'p' not found

Let’s change a number of visual properties to make the plot more attractive.

  • Let’s change the labels of x-axis and y-axis and title: labs(title="Mean", x="Condition", y='Log2(Intensity)')
  • Let’s change background color for white: theme_bw()
  • Let’s change size or color of labels of axes and title, text of x-axis by using a theme
  • Let’s change the position of legend (use 'none' to remove it)
  • Let’s make the box for legend
  • Let’s remove the box for legend key.
p2 <- p + labs(title = "Mean", x = "Group", y = 'Log2(Intensity)') +
    theme_bw() +
    theme(plot.title = element_text(size = 25, colour = "darkblue"),
          axis.title.x = element_text(size = 15),
          axis.title.y = element_text(size = 15),
          axis.text.x = element_text(size = 13),
          legend.position = 'bottom',
          legend.background = element_rect(colour = 'black'),
          legend.key = element_rect(colour = 'white'))
## Error in eval(expr, envir, enclos): object 'p' not found
p2
## Error in eval(expr, envir, enclos): object 'p2' not found

Let’s now add the standard deviation:

# mean with SD
p2 + geom_errorbar(aes(ymax = mean + sd, ymin = mean - sd), width = 0.1) +
      labs(title="Mean with SD")
## Error in eval(expr, envir, enclos): object 'p2' not found

Challenge

Add the standard error of the mean. Which one is smaller?

# mean with SE
p2 + geom_errorbar(aes(ymax = mean + se, ymin=mean - se), width = 0.1) +
    labs(title="Mean with SE")
## Error in eval(expr, envir, enclos): object 'p2' not found
## The SE is narrow than the SD!

Challenge

Add the standard error of the mean with black color.

# mean with SE
p2 + geom_errorbar(aes(ymax = mean + se, ymin=mean - se), width = 0.1, color='black') +
    labs(title="Mean with SE")
## Error in eval(expr, envir, enclos): object 'p2' not found

Working with statistical distributions

For each statistical distribution, we have function to compute

  • density
  • distribution function
  • quantile function
  • random generation

For the normale distribution norm, these are respectively

  • dnorm
  • pnorm
  • qnorm
  • rnorm

Let’s start by sampling 1000000 values from a normal distribution \(N(0, 1)\):

xn <- rnorm(1e6)
hist(xn, freq = FALSE)
rug(xn)
lines(density(xn), lwd = 2)

plot of chunk unnamed-chunk-20

By definition, the area under the density curve is 1. The area at the left of 0, 1, and 2 are respectively:

pnorm(0)
## [1] 0.5
pnorm(1)
## [1] 0.8413447
pnorm(2)
## [1] 0.9772499

To ask the inverse question, we use the quantile function. The obtain 0.5, 0.8413447 and 0.9772499 of our distribution, we need means of:

qnorm(0.5)
## [1] 0
qnorm(pnorm(1))
## [1] 1
qnorm(pnorm(2))
## [1] 2

Finally, the density function gives us the height at which we are for a given mean:

hist(xn, freq = FALSE)
lines(density(xn), lwd = 2)
points(0, dnorm(0), pch = 19, col = "red")
points(1, dnorm(1), pch = 1, col = "blue")
points(2, dnorm(2), pch = 4, col = "orange")

plot of chunk unnamed-chunk-23

Calculate the confidence interval

Now that we’ve covered the standard error of the mean and the standard deviation, let’s investigate how we can add custom confidence intervals (CI) for our measurement of the mean. We’ll add these CI’s to the summary results we previously stored for protein sp|P44015|VAC2_YEAST.

Confidence interval:

\[\mbox{mean} \pm (SE \times \frac{\alpha}{2} ~ \mbox{quantile of t distribution})\]

To calculate the 95% confident interval, we need to be careful and set the quantile for two-sided. We need to divide by two for error. For example, 95% confidence interval, right tail is 2.5% and left tail is 2.5%.

summaryresult$ciw.lower.95 <- summaryresult$mean -
    qt(0.975, summaryresult$len - 1) * summaryresult$se
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
summaryresult$ciw.upper.95 <- summaryresult$mean +
    qt(0.975, summaryresult$len - 1) * summaryresult$se
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
summaryresult
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
# mean with 95% two-sided confidence interval
ggplot(aes(x = Group, y = mean, colour = Group),
       data = summaryresult) +
    geom_point() +
    geom_errorbar(aes(ymax = ciw.upper.95, ymin = ciw.lower.95), width = 0.1) +
    labs(title="Mean with 95% confidence interval", x="Condition", y='Log2(Intensity)') +
    theme_bw() +
    theme(plot.title = element_text(size=25, colour="darkblue"),
          axis.title.x = element_text(size=15),
          axis.title.y = element_text(size=15),
          axis.text.x = element_text(size=13),
          legend.position = 'bottom',
          legend.background = element_rect(colour = 'black'),
          legend.key = element_rect(colour='white'))
## Error in ggplot(aes(x = Group, y = mean, colour = Group), data = summaryresult): could not find function "ggplot"

Challenges

Replicate the above for the 99% two-sided confidence interval.

# mean with 99% two-sided confidence interval
summaryresult$ciw.lower.99 <- summaryresult$mean - qt(0.995,summaryresult$len-1) * summaryresult$se
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
summaryresult$ciw.upper.99 <- summaryresult$mean + qt(0.995,summaryresult$len-1) * summaryresult$se
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
summaryresult
## Error in eval(expr, envir, enclos): object 'summaryresult' not found
ggplot(aes(x = Group, y = mean, colour = Group),
       data = summaryresult) +
    geom_point() +
    geom_errorbar(aes(ymax = ciw.upper.99, ymin=ciw.lower.99), width=0.1) +
    labs(title="Mean with 99% confidence interval", x="Condition", y='Log2(Intensity)') +
    theme_bw()+
    theme(plot.title = element_text(size=25, colour="darkblue"),
          axis.title.x = element_text(size=15),
          axis.title.y = element_text(size=15),
          axis.text.x = element_text(size=13),
          legend.position = 'bottom',
          legend.background = element_rect(colour='black'),
          legend.key = element_rect(colour='white'))
## Error in ggplot(aes(x = Group, y = mean, colour = Group), data = summaryresult): could not find function "ggplot"

Some comments

  • Error bars with SD and CI are overlapping between groups!

  • Error bars for the SD show the spread of the population while error bars based on SE reflect the uncertainty in the mean and depend on the sample size.

  • Confidence intervals of n on the other hand mean that the intervals capture the population mean n percent of the time.

  • When the sample size increases, CI and SE are getting closer to each other.

Saving our results

We have two objects that contain all the information that we have generated so far:

  • The summaryresults object, that contains all the summary statistics.
  • The iprg data frame, that was read from the csv file. This object can be easily regenerated using read.csv, and hence doesn’t necessarily to be saved explicity.
save(summaryresult, file = "./data/summaryresults.rda")
## Error in save(summaryresult, file = "./data/summaryresults.rda"): object 'summaryresult' not found
save(iprg, file = "./data/iprg.rda")
## Error in save(iprg, file = "./data/iprg.rda"): object 'iprg' not found

We can also save the summary result as a csv file using the write.csv function:

write.csv(sumamryresult, file = "./data/summary.csv")

Tip: Exporting to csv is useful to share your work with collaborators that do not use R, but for wany continous work in R, to assure data validity accords platforms, the best format is rda.

Part 2: Statistical hypothesis test

First, we are going to prepare the session for further analyses.

load("./data/summaryresults.rda")
load("./data/iprg.rda")

Two sample t-test for one protein with one feature

Now, we’ll perform a t-test whether protein sp|P44015|VAC2_YEAST has a change in abundance between Condition 1 and Condition 2.

Hypothesis

  • \(H_0\): no change in abundance, mean(Condition1) - mean(Condition2) = 0
  • \(H_a\): change in abundance, mean(Condition1) - mean(Condition 2) \(\neq\) 0

Statistics

  • Observed \(t = \frac{\mbox{difference of group means}}{\mbox{estimate of variation}} = \frac{(mean_{1} - mean_{2})}{SE} \sim t_{\alpha/2, df}\)
  • Standard error, \(SE=\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}\)

with

  • \(n_{i}\): number of replicates
  • \(s_{i}^2 = \frac{1}{n_{i}-1} \sum (Y_{ij} - \bar{Y_{i \cdot}})^2\): sample variance

Data preparation

## Let's start with one protein, named "sp|P44015|VAC2_YEAST"
oneproteindata <- iprg[iprg$Protein == "sp|P44015|VAC2_YEAST", ]

## Then, get two conditions only, because t.test only works for two groups (conditions).
oneproteindata.condition12 <- oneproteindata[oneproteindata$Condition %in%
                                             c('Condition1', 'Condition2'), ]
oneproteindata.condition12
##                    Protein Log2Intensity                       Run
## 21096 sp|P44015|VAC2_YEAST      26.30163 JD_06232014_sample1_B.raw
## 21097 sp|P44015|VAC2_YEAST      26.11643 JD_06232014_sample1_C.raw
## 21098 sp|P44015|VAC2_YEAST      26.29089 JD_06232014_sample1-A.raw
## 21099 sp|P44015|VAC2_YEAST      25.81957 JD_06232014_sample2_A.raw
## 21100 sp|P44015|VAC2_YEAST      26.11527 JD_06232014_sample2_B.raw
## 21101 sp|P44015|VAC2_YEAST      26.08498 JD_06232014_sample2_C.raw
##        Condition BioReplicate Intensity TechReplicate
## 21096 Condition1            1  82714388             B
## 21097 Condition1            1  72749239             C
## 21098 Condition1            1  82100518             A
## 21099 Condition2            2  59219741             A
## 21100 Condition2            2  72690802             B
## 21101 Condition2            2  71180513             C
table(oneproteindata.condition12[, c("Condition", "BioReplicate")])
##             BioReplicate
## Condition    1 2
##   Condition1 3 0
##   Condition2 0 3

If we want to remove the levels that are not relevant anymore, we can use droplevels:

oneproteindata.condition12 <- droplevels(oneproteindata.condition12)
table(oneproteindata.condition12[, c("Condition", "BioReplicate")])
##             BioReplicate
## Condition    1 2
##   Condition1 3 0
##   Condition2 0 3

To perform the t-test, we use the t.test function. Let’s first familiarise ourselves with it by looking that the manual

?t.test

And now apply to to our data

# t test for different abundance (log2Int) between Groups (Condition)
result <- t.test(Log2Intensity ~ Condition,
                 data = oneproteindata.condition12,
                 var.equal = FALSE)

result
## 
##  Welch Two Sample t-test
## 
## data:  Log2Intensity by Condition
## t = 2.0608, df = 3.4001, p-value = 0.1206
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1025408  0.5619598
## sample estimates:
## mean in group Condition1 mean in group Condition2 
##                 26.23632                 26.00661

Challenge

Repeat the t-test above but with calculating a 90% confidence interval for the log2 fold change.

The htest class

The t.test function, like other hypothesis testing function, return a result of a type we haven’t encountered yet, the htest class:

class(result)
## [1] "htest"

which stores typical results from such tests. Let’s have a more detailed look at what information we can learn from the results our t-test. When we type the name of our result object, we get a short textual summary, but the object contains more details:

names(result)
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"

and we can access each of these by using the $ operator, like we used to access a single column from a data.frame, but the htest class is not a data.frame (it’s actually a list). For example, to access the group means, we would use

result$estimate
## mean in group Condition1 mean in group Condition2 
##                 26.23632                 26.00661

Challenge

  • Calculate the (log2-transformed) fold change between groups
  • Extract the value of the t-statistics
  • Calculate the standard error (fold-change/t-statistics)
  • Extract the degrees of freedom (parameter)
  • Extract the p values
  • Extract the 95% confidence intervals

We can also manually compute our t-test statistic using the formulas we descibed above and compare it with the summaryresult.

Recall the summaryresult we generated last section.

summaryresult
##        Group     mean         sd         se length ciw.lower.95
## 1 Condition1 26.23632 0.10396539 0.06002444      3     25.97805
## 2 Condition2 26.00661 0.16268179 0.09392438      3     25.60248
## 3 Condition3 23.25609 0.09467798 0.05466236      3     23.02090
## 4 Condition4 20.97056 0.73140174 0.42227499      3     19.15366
##   ciw.upper.95 ciw.lower.99 ciw.upper.99
## 1     26.49458     25.64058     26.83205
## 2     26.41073     25.07442     26.93879
## 3     23.49128     22.71357     23.79860
## 4     22.78746     16.77955     25.16157
summaryresult12 <- summaryresult[1:2, ]

## test statistic, It is the same as 'result$statistic' above.
diff(summaryresult12$mean) ## different sign, but absolute values are same as result$estimate[1]-result$estimate[2]
## [1] -0.2297095
sqrt(sum(summaryresult12$sd^2/summaryresult12$length)) ## same as stand error
## [1] 0.1114662
## the t-statistic : sign is different
diff(summaryresult12$mean)/sqrt(sum(summaryresult12$sd^2/summaryresult12$length))
## [1] -2.060799

Re-calculating the p values

See the previous Working with statistical distributions for a reminder.

Referring back to our t-test results above, we can manually calculate the one- and two-side tests p-values using the t-statistics and the test parameter (using the pt function).

Our result t statistic was 2.0607988 (accessible with result$statistic). Let’s start by visualising it along a t distribution. Let’s create data from such a distribution, making sure we set to appropriate parameter.

## generate 10^5 number with the same degree of freedom for distribution.
xt <- rt(1e5, result$parameter)
plot(density(xt), xlim = c(-10, 10))
abline(v = result$statistic, col = "red") ## where t statistics are located.
abline(h = 0, col = "gray") ## horizontal line at 0

plot of chunk unnamed-chunk-40

The area on the left of that point is given by pt(result$statistic, result$parameter), which is 0.939693. The p-value for a one-sided test, which is ** the area on the right** of red line, is this given by

1 - pt(result$statistic, result$parameter)
##          t 
## 0.06030697

And the p-value for a two-sided test is

2 * (1 - pt(result$statistic, result$parameter))
##         t 
## 0.1206139

which is the same as the one calculated by the t-test.


Part 3a: Sample size calculation

To calculate the required sample size, you’ll need to know four things:

  • \(\alpha\): confidence level
  • \(power\): 1 - \(\beta\), where \(\beta\) is probability of a true positive discovery
  • \(\Delta\): anticipated fold change
  • \(\sigma\): anticipated variance

R code

Assuming equal varaince and number of samples across groups, the following formula is used for sample size estimation:

\[\frac{2{\sigma}^2}{n}\leq(\frac{\Delta}{z_{1-\beta}+z_{1-\alpha/2}})^2\]

library("pwr")

## ?pwr.t.test

# Significance level alpha
alpha <- 0.05

# Power = 1 - beta
power <- 0.95

# anticipated log2 fold change
delta <- 1

# anticipated variability
sigma <- 0.9

# Effect size
# It quantifies the size of the difference between two groups
d <- delta/sigma

#Sample size estimation
pwr.t.test(d = d, sig.level = alpha, power = power, type = 'two.sample')
## 
##      Two-sample t test power calculation 
## 
##               n = 22.06036
##               d = 1.111111
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Challenge

  • Calculate power with 10 samples and the same parameters as above.

Let’s investigate the effect of required fold change and variance on the sample size estimation.

# anticipated log2 fold change
delta <- seq(0.1, 0.7, .1)
nd <- length(delta)

# anticipated variability
sigma <- seq(0.1,0.5,.1)
ns <- length(sigma)

# obtain sample sizes
samsize <- matrix(0, nrow=ns*nd, ncol = 3)
counter <- 0
for (i in 1:nd){
  for (j in 1:ns){
    result <- pwr.t.test(d = delta[i]/sigma[j],
                         sig.level = alpha,
                         power = power,
                         type = "two.sample")
    counter <- counter + 1
    samsize[counter,1] <- delta[i]
    samsize[counter,2] <- sigma[j]
    samsize[counter,3] <- ceiling(result$n)
  }
}
colnames(samsize) <- c("desiredlog2FC","variability","samplesize")


library("ggplot2")
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
samsize <- as.data.frame(samsize)
samsize$variability <- as.factor(samsize$variability)
ggplot(data=samsize, aes(x=desiredlog2FC, y=samplesize, group = variability, colour = variability)) +
  geom_line() +
  geom_point(size=2, shape=21, fill="white") +
  labs(title="Significance level=0.05, Power=0.95", x="Anticipated log2 fold change", y='Sample Size (n)') +
  theme(plot.title = element_text(size=20, colour="darkblue"),
        axis.title.x = element_text(size=15),
        axis.title.y = element_text(size=15),
        axis.text.x = element_text(size=13))

plot of chunk unnamed-chunk-45


Part 3b: Choosing a model

The decision of which statistical model is appropriate for a given set of observations depends on the type of data that have been collected.

  • Quantitative response with quantitative predictors : regression model

  • Categorical response with quantitative predictors : logistic regression model for bivariate categorical response (e.g., Yes/No, dead/alive), multivariate logistic regression model when the response variable has more than two possible values.

  • Quantitative response with categorical predictors : ANOVA model (quantitative response across several populations defined by one or more categorical predictor variables)

  • Categorical response with categorical predictors : contingency table that can be used to draw conclusions about the relationships between variables.

Part 5b are adapted from Bremer & Doerge, Using R at the Bench : Step-by-Step Data Analytics for Biologists , cold Spring Harbor LaboratoryPress, 2015.


Part 3c: Analysis of categorical data

For this part, we are going to use a new dataset, which contains the patient information from TCGA colorectal cohort. This data is from Proteogenomic characterization of human colon and rectal cancer (Zhang et al. 2014).

Let’s read in and explore the TCGA colorectal cancer data:

TCGA.CRC <- read.csv("./data/TCGA_sample_information.csv")
head(TCGA.CRC)
##   TCGA.participant.ID Gender Cancer BRAF.mutation history_of_colon_polyps
## 1        TCGA-A6-3807 Female  Colon             0                      NO
## 2        TCGA-A6-3808   Male  Colon             0                     YES
## 3        TCGA-A6-3810   Male  Colon             0                     YES
## 4        TCGA-AA-3518 Female  Colon             0                      NO
## 5        TCGA-AA-3525   Male  Colon             1                      NO
## 6        TCGA-AA-3526   Male  Colon             0                     YES

Rows in the data array are patients and columns are patient information. The column definition is shown as following:

Variable
TCGA participant ID
Gender
Cancer type
BTAF mutation status
History of colon polyps

Let’s be familiar with data first.

## check the dimension of data
dim(TCGA.CRC)
## [1] 78  5
dim(TCGA.CRC$Gender) # dim function is for matrix, array or data frame.
## NULL
## check unique information of Gender column.
unique(TCGA.CRC$Gender)
## [1] Female Male  
## Levels: Female Male
class(TCGA.CRC$Gender)
## [1] "factor"

Challenge

  • Get unique information and class for Cancer information
  • Get unique information and class for BRAF.mutation information
  • Get unique information and class for history_of_colon_polyps information

table function generates contingency table (or classification table with frequency of outcomes) at each combination.

## check how many female and male are in the dataset
table(TCGA.CRC$Gender)
## 
## Female   Male 
##     34     44
## check unique information if paticipant ID
unique(TCGA.CRC$TCGA.participant.ID)
##  [1] TCGA-A6-3807 TCGA-A6-3808 TCGA-A6-3810 TCGA-AA-3518 TCGA-AA-3525
##  [6] TCGA-AA-3526 TCGA-AA-3529 TCGA-AA-3531 TCGA-AA-3534 TCGA-AA-3552
## [11] TCGA-AA-3554 TCGA-AA-3558 TCGA-AA-3561 TCGA-AA-3664 TCGA-AA-3666
## [16] TCGA-AA-3672 TCGA-AA-3684 TCGA-AA-3695 TCGA-AA-3710 TCGA-AA-3715
## [21] TCGA-AA-3818 TCGA-AA-3848 TCGA-AA-3986 TCGA-AA-3989 TCGA-AA-A004
## [26] TCGA-AA-A00A TCGA-AA-A00F TCGA-AA-A00K TCGA-AA-A00N TCGA-AA-A00U
## [31] TCGA-AA-A010 TCGA-AA-A01D TCGA-AA-A01F TCGA-AA-A01I TCGA-AA-A01K
## [36] TCGA-AA-A01P TCGA-AA-A01R TCGA-AA-A01S TCGA-AA-A01T TCGA-AA-A01V
## [41] TCGA-AA-A01X TCGA-AA-A01Z TCGA-AA-A022 TCGA-AA-A024 TCGA-AA-A029
## [46] TCGA-AA-A02H TCGA-AA-A02O TCGA-AA-A02Y TCGA-AA-A03F TCGA-AA-A03J
## [51] TCGA-AF-2691 TCGA-AF-2692 TCGA-AF-3400 TCGA-AF-3913 TCGA-AG-3574
## [56] TCGA-AG-3580 TCGA-AG-3584 TCGA-AG-3593 TCGA-AG-A002 TCGA-AG-A008
## [61] TCGA-AG-A00C TCGA-AG-A00H TCGA-AG-A00Y TCGA-AG-A011 TCGA-AG-A014
## [66] TCGA-AG-A015 TCGA-AG-A01L TCGA-AG-A01W TCGA-AG-A01Y TCGA-AG-A020
## [71] TCGA-AG-A026 TCGA-AG-A02N TCGA-AG-A02X TCGA-AG-A032
## 74 Levels: TCGA-A6-3807 TCGA-A6-3808 TCGA-A6-3810 ... TCGA-AG-A032
length(unique(TCGA.CRC$TCGA.participant.ID))
## [1] 74

There are 74 unique participant IDs. but there are total 78 rows. We can suspect that there are multiple rows for some IDs. Let’s check. Get the count per participants ID

countID <- table(TCGA.CRC$TCGA.participant.ID)
countID
## 
## TCGA-A6-3807 TCGA-A6-3808 TCGA-A6-3810 TCGA-AA-3518 TCGA-AA-3525 
##            1            1            1            1            1 
## TCGA-AA-3526 TCGA-AA-3529 TCGA-AA-3531 TCGA-AA-3534 TCGA-AA-3552 
##            1            1            1            1            1 
## TCGA-AA-3554 TCGA-AA-3558 TCGA-AA-3561 TCGA-AA-3664 TCGA-AA-3666 
##            1            1            1            1            1 
## TCGA-AA-3672 TCGA-AA-3684 TCGA-AA-3695 TCGA-AA-3710 TCGA-AA-3715 
##            1            1            1            1            1 
## TCGA-AA-3818 TCGA-AA-3848 TCGA-AA-3986 TCGA-AA-3989 TCGA-AA-A004 
##            1            1            1            1            1 
## TCGA-AA-A00A TCGA-AA-A00F TCGA-AA-A00K TCGA-AA-A00N TCGA-AA-A00U 
##            2            1            2            2            1 
## TCGA-AA-A010 TCGA-AA-A01D TCGA-AA-A01F TCGA-AA-A01I TCGA-AA-A01K 
##            1            1            1            1            1 
## TCGA-AA-A01P TCGA-AA-A01R TCGA-AA-A01S TCGA-AA-A01T TCGA-AA-A01V 
##            1            1            1            1            1 
## TCGA-AA-A01X TCGA-AA-A01Z TCGA-AA-A022 TCGA-AA-A024 TCGA-AA-A029 
##            1            1            1            1            1 
## TCGA-AA-A02H TCGA-AA-A02O TCGA-AA-A02Y TCGA-AA-A03F TCGA-AA-A03J 
##            1            1            1            1            1 
## TCGA-AF-2691 TCGA-AF-2692 TCGA-AF-3400 TCGA-AF-3913 TCGA-AG-3574 
##            1            1            1            1            1 
## TCGA-AG-3580 TCGA-AG-3584 TCGA-AG-3593 TCGA-AG-A002 TCGA-AG-A008 
##            1            1            1            1            1 
## TCGA-AG-A00C TCGA-AG-A00H TCGA-AG-A00Y TCGA-AG-A011 TCGA-AG-A014 
##            1            2            1            1            1 
## TCGA-AG-A015 TCGA-AG-A01L TCGA-AG-A01W TCGA-AG-A01Y TCGA-AG-A020 
##            1            1            1            1            1 
## TCGA-AG-A026 TCGA-AG-A02N TCGA-AG-A02X TCGA-AG-A032 
##            1            1            1            1

Some participant IDs has 2 rows. Let’s get the subset that has more than one row.

unique(countID)
## [1] 1 2
countID[countID > 1]
## 
## TCGA-AA-A00A TCGA-AA-A00K TCGA-AA-A00N TCGA-AG-A00H 
##            2            2            2            2

4 participants have two rows per each. Let’s check the rows for parcipant ID = TCGA-AA-A00A

TCGA.CRC[TCGA.CRC$TCGA.participant.ID == 'TCGA-AA-A00A', ]
##    TCGA.participant.ID Gender Cancer BRAF.mutation history_of_colon_polyps
## 26        TCGA-AA-A00A   Male  Colon             0                      NO
## 27        TCGA-AA-A00A   Male  Colon             0                      NO

There are two rows with the same information. They are duplicated rows. Let’s remove them.

TCGA.CRC <- TCGA.CRC[!duplicated(TCGA.CRC), ]

Challenge

  • Check whether dimension and number of participants ID are changed after removing duplicated rows.

Generate 2-way contingency tables

table function also can calculate 2-way contingency table, which is frequency outcomes of two categorical variables. Let’s get the table and visualise the count data.

cancer.polyps <- table(TCGA.CRC$history_of_colon_polyps, TCGA.CRC$Cancer)
cancer.polyps
##      
##       Colon Rectum
##   NO     29     19
##   YES    21      5
dotchart(cancer.polyps, xlab = "Observed counts")

plot of chunk unnamed-chunk-55

Comparison of two proportions

Hypothesis in general :

\(H_0\) : each population has the same proportion of observations, \(\pi_{j=1|i=1} = \pi_{j=1|i=2}\)

\(H_a\) : different population have different proportion of observations.

Hypothesis that we are interested with this example: whether the proportion of patients who have history of colon polyps in the patients with colon cancer is different from that in the patients with rectal cancer.

polyps <- cancer.polyps[2, ]
cancer <- apply(cancer.polyps, 2, sum)
pt <- prop.test(polyps, cancer)
pt
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  polyps out of cancer
## X-squared = 2.3268, df = 1, p-value = 0.1272
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.03156841  0.45490175
## sample estimates:
##    prop 1    prop 2 
## 0.4200000 0.2083333
# name of output
names(pt)
## [1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value" 
## [6] "conf.int"    "alternative" "method"      "data.name"
# proportion in each group
pt$estimate
##    prop 1    prop 2 
## 0.4200000 0.2083333
# test statistic value
pt$statistic
## X-squared 
##   2.32678
# degree of freedom
pt$parameter
## df 
##  1

Test of independence

Hypothesis in general :

\(H_0\) : the factors are independent.

\(H_a\) : the factors are not independent.

Hypothesis that we are interested with this example: whether history of colon polyps and type of colon cancer are independent or not.

Pearson Chi-squared test

\[\chi^2 =\sum_{i=1}^2 \sum_{j=1}^2 \frac{(O_{ij}-E_{ij})^2}{E_{ij}} \sim \chi^2_{(2-1)(2-1)}\]

\(O_{ij}\) : \(n_{ij}\), which is the count within the cells

\(E_{ij}\) : \(n_{i+}n_{+j}/n\), where \(n_{i+}\) is the row count sum, \(n_{+j}\) is the column count sum and n is the total count.

!! assumption : the distribution of the test statistic is approximately chi-square if the expected counts are large enough. Use this test only if the expected count for each cell is greater than 5.

chisq.test(cancer.polyps)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cancer.polyps
## X-squared = 2.3268, df = 1, p-value = 0.1272

Mathematically, two tests above are equivalent. prop.test() uses chisq.test() internally and print output differently.

Fisher’s exact test

Fisher’s exact test is a non-parametric test for independence. It compares the distributions of counts within 4 cells. p-value for Fisher’s exact test is the probability of obtaining more extreme data by chance than the data observed if the row and column totals are held fixed.

There are no assumptions on distributions or sample sizes for this test. Therefore, the Fisher’s exact test can be used with small sample sizes. However, if the sample sizes are very small, then the power of the test will be very low.

Apply the Fisher’s exact test using the fisher.test function and extract the odds ratio.

ft <- fisher.test(cancer.polyps)
ft
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cancer.polyps
## p-value = 0.1177
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.09234132 1.24084405
## sample estimates:
## odds ratio 
##  0.3681978

and extract the odds ratio.

ft$estimate
## odds ratio 
##  0.3681978

Challenge

  • Compare the proportion of male patients in the patients with colon cancer is different from that in the patients with rectal cancer.

Optional practice : Large-sample Z-test

We could also apply a z-test for comparison of two proportions, defined as

\[Z=\frac{\widehat{p}_1-\widehat{p}_2}{\sqrt{\widehat{p} (1- \widehat{p}) (\frac{1}{n_1} + \frac{1}{n_2})}}\]

where \(\widehat{\pi}_1 = \frac{y_{1}}{n_1}\), \(\widehat{\pi}_2 = \frac{y_{2}}{n_2}\) and \(\widehat{p}=\frac{x_1 + x_2}{n_1 + n_2}\).

We are going to use this test to illustrate how to write functions in R.

An R function is created with the function constructor, named function, and is composed of:

  1. a name: we will call our function to calculate the p-value z.prop.p;
  2. some inputs: our inputs are the number of observations for the outcome of interest, x1 and x2, and the total number of observation in each category, n1 and n2;
  3. a returned value (output): the computed p-value, named pvalue;
  4. a body, i.e. the code that, given the inputs, calculates the output.
z.prop.p <- function(x1, x2, n1, n2) {
    pi_1 <- x1/n1
    pi_2 <- x2/n2
    numerator <- pi_1 - pi_2
    p_common <- (x1+x2)/(n1+n2)
    denominator <- sqrt(p_common * (1-p_common) * (1/n1 + 1/n2))
    stat <- numerator/denominator
    pvalue <- 2 * (1 - pnorm(abs(stat)))
    return(pvalue)
}

z.prop.p(cancer.polyps[2,1], cancer.polyps[2,2], sum(cancer.polyps[,1]), sum(cancer.polyps[,2]))
## [1] 0.07418571

Challenge

Write a function named f2c (c2f) that converts a temperature from Fahrenheit to Celsium (Celsium to Fahrenheit) using the following formula \(F = C \times 1.8 + 32\) (\(C = \frac{F - 32}{1.8}\)).


Part 4: Linear models and correlation

When considering correlations and modelling data, visualisation is key.

Let’s use the famous Anscombe’s quartet data as a motivating example. This data is composed of 4 pairs of values, \((x_1, y_1)\) to \((x_4, y_4)\):

x1 x2 x3 x4 y1 y2 y3 y4
10 10 10 8 8.04 9.14 7.46 6.58
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47
14 14 14 8 9.96 8.10 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25
4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89

Each of these \(x\) and \(y\) sets have the same variance, mean and correlation:

1 2 3 4
var(x) 11.0000000 11.0000000 11.0000000 11.0000000
mean(x) 9.0000000 9.0000000 9.0000000 9.0000000
var(y) 4.1272691 4.1276291 4.1226200 4.1232491
mean(y) 7.5009091 7.5009091 7.5000000 7.5009091
cor(x,y) 0.8164205 0.8162365 0.8162867 0.8165214

But…

While the residuals of the linear regression clearly indicate fundamental differences in these data, the most simple and straightforward approach is visualisation to highlight the fundamental differences in the datasets.

plot of chunk anscombefig

See also another, more recent example: The Datasaurus Dozen dataset.

The Datasaurus Dozen dataset

Correlation

Here is an example where a wide format comes very handy. We are going to convert our iPRG data using the spread function from the tidyr package:

library("tidyr")
iprg2 <- spread(iprg[, 1:3], Run, Log2Intensity)
rownames(iprg2) <- iprg2$Protein
iprg2 <- iprg2[, -1]

And lets focus on the 3 runs, i.e. 2 replicates from condition 1 and

x <- iprg2[, c(1, 2, 10)]
head(x)
##                       JD_06232014_sample1-A.raw JD_06232014_sample1_B.raw
## sp|D6VTK4|STE2_YEAST                   26.58301                  26.81232
## sp|O13297|CET1_YEAST                   24.71809                  24.71912
## sp|O13329|FOB1_YEAST                   23.47075                  23.37678
## sp|O13539|THP2_YEAST                   24.29661                  27.52021
## sp|O13547|CCW14_YEAST                  27.11638                  27.22234
## sp|O13563|RPN13_YEAST                  26.17056                  26.09476
##                       JD_06232014_sample4-A.raw
## sp|D6VTK4|STE2_YEAST                   26.65573
## sp|O13297|CET1_YEAST                   24.50814
## sp|O13329|FOB1_YEAST                   23.03473
## sp|O13539|THP2_YEAST                   25.07576
## sp|O13547|CCW14_YEAST                  27.07526
## sp|O13563|RPN13_YEAST                  25.77958
pairs(x)

plot of chunk unnamed-chunk-64

We can use the cor function to calculate the Pearson correlation between two vectors of the same length (making sure the order is correct), or a dataframe. But, we have missing values in the data, which will stop us from calculating the correlation:

cor(x)
##                           JD_06232014_sample1-A.raw
## JD_06232014_sample1-A.raw                         1
## JD_06232014_sample1_B.raw                        NA
## JD_06232014_sample4-A.raw                        NA
##                           JD_06232014_sample1_B.raw
## JD_06232014_sample1-A.raw                        NA
## JD_06232014_sample1_B.raw                         1
## JD_06232014_sample4-A.raw                        NA
##                           JD_06232014_sample4-A.raw
## JD_06232014_sample1-A.raw                        NA
## JD_06232014_sample1_B.raw                        NA
## JD_06232014_sample4-A.raw                         1

We first need to omit the proteins/rows that contain missing values

x2 <- na.omit(x)
cor(x2)
##                           JD_06232014_sample1-A.raw
## JD_06232014_sample1-A.raw                 1.0000000
## JD_06232014_sample1_B.raw                 0.9794954
## JD_06232014_sample4-A.raw                 0.9502142
##                           JD_06232014_sample1_B.raw
## JD_06232014_sample1-A.raw                 0.9794954
## JD_06232014_sample1_B.raw                 1.0000000
## JD_06232014_sample4-A.raw                 0.9502517
##                           JD_06232014_sample4-A.raw
## JD_06232014_sample1-A.raw                 0.9502142
## JD_06232014_sample1_B.raw                 0.9502517
## JD_06232014_sample4-A.raw                 1.0000000

A note on correlation and replication

It is often assumed that high correlation is a halmark of good replication. Rather than focus on the correlation of the data, a better measurement would be to look a the log2 fold-changes, i.e. the distance between or repeated measurements. The ideal way to visualise this is on an MA-plot:

par(mfrow = c(1, 2))
r1 <- x2[, 1]
r2 <- x2[, 2]
M <- r1 - r2
A <- (r1 + r2)/2
plot(A, M); grid()
suppressPackageStartupMessages(library("affy"))
affy::ma.plot(A, M)

plot of chunk unnamed-chunk-67

See also this post on the Simply Statistics blog.

Linear modelling

abline(0, 1) can be used to add a line with intercept 0 and slop 1. It we want to add the line that models the data linearly, we can calculate the parameters using the lm function:

lmod <- lm(r2 ~ r1)
summary(lmod)
## 
## Call:
## lm(formula = r2 ~ r1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4939 -0.0721  0.0126  0.0881  3.4595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.348190   0.091842   3.791 0.000153 ***
## r1          0.985878   0.003688 267.357  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3263 on 3024 degrees of freedom
## Multiple R-squared:  0.9594, Adjusted R-squared:  0.9594 
## F-statistic: 7.148e+04 on 1 and 3024 DF,  p-value: < 2.2e-16

which can be used to add the adequate line that reflects the (linear) relationship between the two data

plot(r1, r2)
abline(lmod, col = "red")

plot of chunk unnamed-chunk-69

As we have seen in the beginning of this section, it is essential not to rely solely on the correlation value, but look at the data. This also holds true for linear (or any) modelling, which can be done by plotting the model:

par(mfrow = c(2, 2))
plot(lmod)

plot of chunk unnamed-chunk-70

  • Cook’s distance is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis and can be used to highlight points that particularly influence the regression.

  • Leverage quantifies the influence of a given observation on the regression due to its location in the space of the inputs.

See also ?influence.measures.

Challenge

  1. Take any of the iprg2 replicates, model and plot their linear relationship. The iprg2 data is available as an rda file, or regenerate it as shown above.
  2. The Anscombe quartet is available as anscombe. Load it, create a linear model for one \((x_i, y_i)\) pair of your choice and visualise/check the model.
x3 <- anscombe[, 3]
y3 <- anscombe[, 7]
lmod <- lm(y3 ~ x3)
summary(lmod)
## 
## Call:
## lm(formula = y3 ~ x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
par(mfrow = c(2, 2))
plot(lmod)

plot of chunk unnamed-chunk-71

Finally, let’s conclude by illustrating how ggplot2 can very elegantly be used to produce similar plots, with useful annotations:

library("ggplot2")
dfr <- data.frame(r1, r2, M, A)
p <- ggplot(aes(x = r1, y = r2), data = dfr) + geom_point()
p + geom_smooth(method = "lm") +
    geom_quantile(colour = "red")

plot of chunk unnamed-chunk-72

Challenge

Replicate the MA plot above using ggplot2. Then add a non-parametric lowess regression using geom_smooth().

p <- ggplot(aes(x = A, y = M), data = dfr) + geom_point()
p + geom_smooth() + geom_quantile(colour = "red")
## `geom_smooth()` using method = 'gam'
## Smoothing formula not specified. Using: y ~ x

plot of chunk unnamed-chunk-73


Back to course home page